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ABSTRACT 


A  new  technique  for  co-channel  talker  interference  suppression  has  been  developed,  and  applied 
effectively  in  situations  where  the  speech  waveforms  from  both  the  desired  talker  and  the  interfering 
talker  are  vocalic.  The  technique  combines  a  minimum  mean-squared  error  estimation  procedure 
with  a  sinusoidal  analysis/ synthesis  model  of  speech  as  a  sum  of  sinusoids  with  time-varying 
amplitudes,  frequencies,  and  phases.  For  the  received  waveform,  which  is  the  additive  combination 
of  two  speech  signals  on  a  single-channel,  least-squared  error  estimates  of  the  sinusoidal  model 
parameters  for  each  of  the  two  speech  signals  are  made.  A  synthesizer  based  on  the  sinusoidal 
model  is  used  to  reconstruct  the  speech  of  the  desired  talker. 

Initial  studies  of  the  feasibility  of  the  new  technique  have  examined  the  level  of  interference 
suppression  attained  when  the  least-squared  error  estimation  was  performed  with  a  priori  knowl¬ 
edge  of  (l)all  the  sine-wave  frequencies,  and  (2)  the  fundamental  frequency  of  each  speech 
waveform  prior  to  summation.  In  both  cases,  the  sine-wave  amplitudes  and  phases  were  unknown. 
When  the  frequencies  of  both  waveforms  were  obtained  by  peak-picking  of  the  individual  short- 
time  Fourier  transforms  prior  to  summation,  the  least-squared  error  strategy  yielded  good  suppres¬ 
sion  of  the  interfering  speech  and  enhancement  of  the  target  speech  over  a  wide  range  (9  to  - 1 6  dB) 
of  target-to-interferer  ratios.  When  the  individual  fundamental  frequency  contours  were  provided, 
the  enhancement  was  only  slightly  degraded.  For  both  cases,  the  performance  was  significantly 
improved  by  a  multi-frame  interpolation  technique  which  predicts  the  time  evolution  of  the 
sinusoidal  parameters  across  frames  where  the  frequencies  of  the  two  waveforms  are  closely  spaced 
and,  therefore,  difficult  to  track. 

Finally,  the  least-squared-error  approach  was  tested  with  no  a  priori  information  provided  on 
either  of  the  two  waveforms.  The  least-squared  error  criterion  was  extended  to  estimate  both 
fundamental  frequency  contours  from  the  summed  waveform,  and  then  applied  further  to  estimate 
the  remaining  sinusoidal  parameters.  This  technique  was  demonstrated  to  provide  useful  interfer¬ 
ence  suppression  when  the  summed  vocalic  speech  waveforms  have  equal  intensities  and  smooth, 
nonintersecting  fundamental  frequency  contours. 

The  results  obtained,  though  limited  in  their  scope,  provide  evidence  that  the  combination  of  the 
sinusoidal  analysis /synthesis  model  with  effective  parameter  estimation  techniques  offers  a  promis¬ 
ing  approach  to  the  currently-unsolved  problem  of  co-channel  talker  interference  suppression  over 
a  range  of  conditions.  Promising  areas  for  further  investigation  are  identified. 
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AN  APPROACH  TO  CO-CHANNEL  TALKER  INTERFERENCE 
SUPPRESSION  USING  A  SINUSOIDAL  MODEL  FOR  SPEECH 


1.  INTRODUCTION 

In  a  number  of  important  applications,  it  is  desirable  to  suppress  an  interfering  waveform 
which  degrades  a  desired  signal.  When  the  desired  signal  and  the  interfering  signal  are  additively 
combined  speech  waveforms,  the  goal  is  to  enhance  the  intelligibility  of  a  target  speaker.  This 
problem  is  often  referred  to  as  co-channel  talker  interference  suppression.  The  interfering  speech 
may  have  been  introduced  in  a  microphone  environment  or  may  have  resulted  from  cross  talk  in 
a  neighboring  communications  channel.* 

This  report  describes  a  new  approach  to  co-channel  talker  interference  suppression  based  on 
a  sinusoidal  representation  of  speech.  The  technique  fits  a  sinusoidal  model  to  additive  vocalic 
speech  segments  such  that  the  least  mean  squared  error  between  the  model  and  the  combined 
waveforms  is  obtained.  Enhancement  is  achieved  by  synthesizing  a  waveform  from  the  sine  waves 
attributed  to  the  desired  speaker.  Least  squares  estimation  is  applied  to  obtain  sine-wave  ampli¬ 
tudes  and  phases  of  both  talkers,  based  on  either  a  priori  sine-wave  frequencies  or  a  priori  fun¬ 
damental  frequency  contours.  When  the  frequencies  of  the  two  waveforms  are  closely  spaced,  the 
least  squares  approach  can  have  difficulty  in  tracking  the  sine-wave  parameters.  In  these  cases, 
the  performance  is  significantly  improved  by  an  interpolation  technique  which  predicts  the  time 
evolution  of  the  sinusoidal  parameters  across  multiple  analysis  frames.  The  least-squared  error 
approach  is  also  extended  to  estimate  fundamental  frequency  contours  of  both  speakers  from  the 
summed  waveform,  and  is  applied  further  to  estimate  the  remaining  sinusoidal  parameters. 

Numerous  other  methods  have  been  proposed  for  vocalic  speech  separation.  One  approach 
relies  on  the  short  time  Fourier  transform  (STFT)  of  the  target  speech  having  its  energy  focused 
in  regions  about  multiples  of  the  target  speaker’s  fundamental  frequency.  Intelligibility  improve¬ 
ments  are  attempted  by  comb  filtering  the  interfering  speaker’s  harmonics  from  the  sum.^-^  A  dis¬ 
advantage  of  comb  filtering  stems  from  the  short  duration  over  which  speech  remains  stationary 
with  respect  to  a  periodicity  assumption.  The  duration  of  the  comb  filter’s  impulse  response  must 
be  made  correspondingly  short.  This  constraint  prevents  the  separation  of  closely  spaced  harmon¬ 
ics.  One  method  which  explicitly  attempts  to  resolve  closely  spaced  harmonics  was  introduced  by 
Parsons^’^  who  exploited  the  shape  of  the  Fourier  transform  of  the  time-domain  window  used  in 
computing  the  STFT.  Another  class  of  methods,  harmonic  magnitude  suppression  (HMS),  was 
introduced  by  Hanson  and  Wong*  and  further  developed  by  Naylor  and  Boll,^’^  and  Childers 
and  Lee.***  In  this  approach,  the  short  time  Fourier  transform  magnitude  (STFTM)  of  the 
summed  speech  is  sampled  at  the  harmonics  of  the  interferer  which  is  assumed  to  be  much  larger 
(e.g.,  6  to  16  dB  larger)  than  the  target  speech.  These  samples  are  used  to  obtain  an  STFTM 
estimate  of  the  interfering  speech  which  is  then  subtracted  from  the  STFTM  of  the  sum  to  yield 
an  estimate  of  the  target  spectral  magnitude.  The  phase  from  the  original  summed  waveform  is 
used  to  supply  the  phase  for  the  estimate  of  the  target  utterance. 
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In  addition  to  a  new  framework  for  the  talker  interference  suppression  problem,  the  sinu¬ 
soidal  approach  of  this  report  differs  from  other  methods  in  three  important  ways.  First,  the 
sine-wave  based  method  allows  for  closely  spaced  harmonics  by  modeling  the  linear  dependence 
of  the  STFT  on  sine-wave  parameters  for  each  speaker,  features  in  the  STFT  are  not  relied  on  in 
detecting  the  presence  of  closely  spaced  harmonics. Second,  the  sine-wave  phases,  and  hence 
the  phase  of  the  STFT  of  each  speech  waveform,  is  explicitly  estimated.  And  finally,  when 
parameter  separation  is  difficult  in  regions  of  closely-spaced  harmonics,  the  time  evolution  of 
model  parameters  is  exploited. 

The  outline  of  this  report  is  as  follows.  Section  2  reviews  speech  analysis/ synthesis  based  on 
the  sinusoidal  model  and  gives  the  extension  to  the  two-speaker  case.  Two  candidate  methods  for 
talker  separation  with  this  system,  peak-picking  and  frequency-sampling,  are  described.  The  prob¬ 
lem  of  using  these  systems  for  talker  separation  with  closely-spaced  frequencies  motivates  the  new 
least  squares  sine-wave  parameter  estimation  methods  described  in  Sections  4  and  S.  In 
Section  4,  knowledge  of  the  two  underlying  pitch  periods,  or  more  generally  the  sinusoidal  fre¬ 
quencies,  is  assumed.  Under  this  condition,  an  equivalency  is  demonstrated  between  the  least 
squares  solution  and  a  simple  frequency-domain  solution  which  uses  the  linear  dependence  of  the 
STFT  on  sine-wave  parameters.  A  method  for  estimating  the  pitch  contours  from  the  summed 
speech  waveforms  is  discussed  in  Section  S.  The  focus  of  Section  6  is  the  multi-frame  interpola¬ 
tion  strategy  for  estimating  the  amplitudes  and  phases  of  sine  waves  that  are  not  resolved  by  the 
least  squares  estimation  method.  Section  7  gives  the  results  of  informal  listening  tests  that  illus¬ 
trate  performance.  Finally,  Section  8  makes  suggestions  for  further  research. 


2.  THE  TALKER  INTERFERENCE  SUPPRESSION  PROBLEM 
IN  THE  CONTEXT  OF  THE  SINE-WAVE  MODEL 

In  this  section,  the  sinusoidal  model  of  speech  is  extended  to  the  two-speaker  case.  Based  on 
this  model,  a  speech  analysis-synthesis  system  is  described  and  two  methods  of  interference  sup¬ 
pression  are  proposed.  The  frequency-domain  properties  of  these  approaches  are  discussed  and 
the  problem  of  separating  model  parameters  for  closely  spaced  frequencies  described.  The  sec¬ 
tion  begins  with  a  review  of  an  analysis-synthesis  system  for  the  single-speaker  case. 


2.1  Speech  Analysis-Synthesis  Based  on  a  Sinusoidal  Model 


According  to  the  speech  production  mechanism,  speech  can  be  modeled  as  the  output  of  a 
slowly-varying  vocal  tract  filter  with  a  quasi  (i.e.,  “almost”)  periodic  excitation  input  during 
voiced  speech  and  a  noise-like  excitation  during  unvoiced  speech.**  Under  this  condition,  the 
speech  waveform  can  be  represented  by  a  sum  of  sine  waves  with  time-varying  amplitudes,  fre¬ 
quencies,  and  phases.2’3 

M 

x(n)  =  2]  cos  [0k(n)]  (2.1) 

k=l 

where  the  amplitudes  and  phases,  are  denoted  by  aj^fn)  and  dk(n),  respectively,  and  the  time- 
varying  frequency  of  each  sine  wave  is  given  by  the  derivative  of  the  phase  and  will  be  denoted 
by  a)k(n)  =  flic(n).  If  the  excitation  is  periodic,  with  a  slowly-varying  period,  then  the  frequencies 
can  be  represented  by  multiples  of  a  slowly-varying  fundamental  frequency,  a>o(n).  This  harmonic 
model  and  the  more  general  model  (2.1)  will  be  used  in  the  various  systems  throughout  this 
report.  For  the  purpose  of  designing  an  analysis-synthesis  system  based  on  the  model  (2.1),  we 
simplify  the  phase  function  d^.{n)  by  assuming  a  fixed  frequency  a>|(  and  fixed  amplitude  in  a 
time  interval  over  which  waveform  analysis  will  take  place  (typically  20  to  30  ms).  Under  this 
condition,  the  model  in  (2.1)  is  given  by 

M 

x(n)  =  21  cos  [wfen  +  d>jj]  (2.2a) 

k=l 


where 

0|t(n)  =  oii^n  +  (2.2b) 

and  where  the  phase  value  0]^  is  the  phase  offset  measured  relative  to  the  origin  of  the  analysis 
frame  (i.e.,  n  =  0). 

An  overview  of  an  analysis-synthesis  system  based  on  this  sinusoida'  model  (2.1)  and  (2.2)  is 
depicted  in  Figure  2-1  (References  2  and  3).  A  short-time  Fourier  transform  (STFT)  is  computed 
every  10  to  20  ms  over  a  20  to  30  ms  analysis  window  using  the  discrete  Fourier  transform 
(DFT).  The  frequencies  are  estimated  by  picking  the  peaks  of  the  uniformly  spaced  samples  of 
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ANALYSIS  SYNTHESIS 


Figure  2-1.  Sinusoidal  analysis-synthesis. 


the  short-time  Fourier  transform  magnitude  (STFTM).  Alternatively,  the  frequencies  can  be  esti¬ 
mated  via  an  estimate  of  the  fundamental  frequency.  The  sine-wave  amplitudes  a|^  and  phases 
for  each  analysis  frame  are  then  given  by  the  amplitude  and  phase  of  the  STFT  at  the  measured 
frequencies. 

The  first  step  in  the  synthesis  requires  association  of  the  frequencies  measured  on  one  frame 
with  those  obtained  on  a  successive  frame.  This  is  accomplished  with  a  nearest-neighbor  matching 
algorithm  which  incorporates  a  birth /death  process  of  the  component  sine  waves.  2  Amplitude  and 
phase  parameters  are  then  interpolated  across  frame  boundaries  at  these  matched  frequency  sets. 
Since  the  amplitudes  are  slowly  varying,  it  suffices  to  interpolate  them  linearly.  In  interpolating 
the  phase,  since  the  phase  is  measured  modulo  2n-,  phase  unwrapping  must  be  performed.  In 
addition,  since  the  phase  is  the  integral  of  the  instantaneous  frequency,  the  interpolation  must 
yield  a  phase  which  is  consistent  with  the  frequencies  measured  at  each  frame  boundary.  To  solve 
this  problem,  a  cubic  polynomial  is  used  for  the  interpolation  function.  The  solution  requires 
constraining  the  cubic  function  and  its  derivative  to  equal  the  measured  phases  and  frequencies, 
respectively,  at  the  frame  boundaries,  and  to  satisfy  a  smoothness  constraint  across  each  frame. 
When  the  sine-wave  frequencies  are  measured  via  peak  picking,  the  waveform  estimate,  x(n), 
obtained  by  summing  the  amplitude-modulated  sine  waves,  is  perceptually  nearly  indistinguishable 
from  the  original. 

2.2  The  Two-Speaker  Case 

The  sinusoidal  speech  model  for  the  single-speaker  case  is  easily  generalized  to  the  two- 
speaker  case.  A  speech  waveform  generated  by  two  simultaneous  talkers  can  be  represented  by  a 
sum  of  two  sets  of  sine  waves  each  with  time-varying  amplitudes,  frequencies,  and  phases 

x(n)  =  Xa(n)  +  Xb(n)  .  (2.3) 

M, 

Xa(n)  =  2  *k(n)  cos  [ea,k(n)] 
k=l 

M(, 

Xb(n)  =  2  bk(n)  cos  [0b.k(n)] 
k=l 

where  the  sequences,  Xa(n)  and  Xb(n)  denote  the  speech  of  speaker  A  and  the  speech  of  speak¬ 
er  B,  respectively.  The  amplitudes  and  phases  associated  with  speaker  A  are  denoted  by  a|((n)  and 
®a,k(")  frequencies  are  given  by  a<a,k(o)  =  ^a,k(")-  ^  similar  parameter  set  is  associated 

with  speaker  B.  If  the  excitation  is  periodic,  a  two-speaker  harmonic  model  can  be  used  where 
the  frequencies  associated  with  speaker  A  and  speaker  B  are  multiples  of  two  underlying  funda¬ 
mental  frequencies,  a<a(n)  and  (Ub(n),  respectively.  In  the  steady-state  case  where  the  vocal  cords 
and  vocal  tract  characteristics  are  assumed  fixed  over  the  analysis  time  interval,  we  can  write  the 
model  of  (2.3)  as 
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(2.4) 


x(n)  =  x.(n)  +  xi,(”) 

M, 

Xa(n)  =  2  ak  cos  (<iia,kn  +  4>a.k) 

k=l 

Mb 

*b(n)  =  cos  («»b,kn  +  0b.k) 

k=i 

which  is  a  useful  model  on  which  to  base  sine-wave  analysis. 

Using  the  model  (2.3)  and  (2.4),  it  is  possible,  as  in  the  single  speaker  case,  to  reconstruct 
the  two-speaker  waveform  with  the  analysis-synthesis  system  illustrated  in  Figure  2-1.  In  order  to 
obtain  an  accurate  representation  of  the  waveform,  the  number  of  sine  waves  in  the  underlying 
model  is  chosen  to  account  for  the  presence  of  two  speakers.  The  presence  of  two  speakers  also 
requires  that  the  analysis  window  length  be  chosen  to  resolve  frequencies  more  closely  spaced 
than  in  the  single-speaker  case.  Due  to  the  requirement  of  time  resolution,  however,  the  analysis 
window  length  was  chosen  to  give  adequate  frequency  resolution  for  the  lower-pitch  speaker.  The 
reconstruction  yields  synthetic  speech  that  is  again  nearly  indistinguishable  from  the  original 
summed  waveform. 

The  ability  to  recover  the  summed  waveform  via  the  analysis-synthesis  system  of  Figure  2-1 
suggests  the  method  of  peak  picking  in  Figure  2-2  for  recovering  a  desired  waveform  x^fn)  which 
is  of  lower  intensity  than  an  interfering  background  talker  x^fn).  The  largest  peaks  of  the 
summed  spectra  (the  number  of  peaks  is  equal  to  or  less  than  the  number  required  to  represent  a 
single  waveform)  arc  chosen  and  are  used  to  reconstruct  the  larger  of  the  two  waveforms.  This 
waveform  estimate  is  then  subtracted  from  the  combined  waveform  to  form  an  estimate  of  the 
lower  signal.  The  largest  peaks  of  the  summed  spectra,  however,  do  not  necessarily  represent  the 
peaks  of  the  spectra  of  the  larger  waveform;  that  is,  they  will  in  general  contain  information 
about  both  waveforms.  The  parameters  which  form  the  basis  for  the  reconstruction  of  the 
summed  waveforms  do  not  necessarily  form  the  basis  for  reconstructing  the  individual  speech 
waveforms. 

An  example  is  illustrated  in  Figure  2-3  which  shows  the  sine-wave  frequency  tracks  derived 
from  combined  voiced  segments  of  male  and  female  speech.  Since  the  female  speaker  is  18  dB 
above  the  male  speaker,  we  expect  the  larger  peaks  to  represent  the  peaks  of  the  female  speaker. 
In  fact,  as  shown  in  Figure  2-3(c),  the  frequency  tracks  derived  from  the  largest  peaks  appear  to 
correspond  to  primarily  the  tracks  of  the  larger  speaker.  The  waveform  reconstructed  using  these 
frequency  tracks  and  corresponding  amplitudes  and  phases  of  the  STFT,  however,  manifests  the 
lower  speaker  almost  as  clearly  as  in  the  original  summed  waveform.  The  attempted  recovery  of 
the  larger  speaker  with  only  25,  10,  and  even  5  of  the  largest  peaks  still  allows  the  lower  speaker 
to  be  heard  in  the  synthesis.  Performing  the  subtraction  indicated  in  Figure  2-2  results  in  a 
“garbled”  reconstruction  with  no  apparent  enhancement  of  x^fn).  These  results  held  for  summed 
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Figure  2-2.  Sinusoidal  reconsiruciion  of  speech  from  the  summed  waveform. 


all-voiced  passages,  as  well  as  summed  voiced  and  unvoiced  passages.  One  problem  with  this 
technique,  described  in  the  following  section,  is  that  closely  spaced  frequencies  associated  with 
different  speakers  may  be  seen  as  one  peak  by  the  peak-picking  process.  When  the  analysis  win¬ 
dow  length  is  made  very  large,  e.g.,  SO  ms,  thus  improving  frequency  resolution,  the  lower 
speaker  is  further  suppressed  in  the  reconstruction  of  ^(n).  Even  in  this  case,  however,  inade¬ 
quate  time  resolution,  as  well  as  closely-spaced  frequencies,  prevent  the  enhancement  of 
the  interference  suppression  system  of  Figure  2-2. 

Another  approach  to  separating  two  summed  passages  via  sine-wave  analysis-synthe:>is  is 
illustrated  in  Figure  2-4.  Knowledge  of  the  sine-wave  frequencies  of  each  of  the  two  speakers  is 
assumed;  the  frequency  sets  are  obtained  by  peak-picking  individual  STFTMs  and  then  are  used 
to  sample  the  summed  STFT  to  obtain  amplitude  and  phase  estimates  for  the  sine-wave  represen¬ 
tation  of  each  waveform.  An  estimate  of  the  desired  lower  waveform,  X|,(n),  could  then  be 
directly  reconstructed  or,  as  illustrated  in  Figure  2-4,  an  estimate,  x^fn),  can  be  obtained  by  sub¬ 
tracting  the  reconstructed  larger  waveform  from  the  summed  waveform.  (The  linearity  of  the 
STFT  operator  makes  these  two  estimates  essentially  equivalent.)  We  refer  to  this  approach  as 
frequency  sampling  since  the  summed  STFT  is  sampled  at  the  known  frequencies.  Alternatively, 
the  frequency  sets  might  be  obtained  by  estimating  a  fundamental  frequency  for  each  speaker. 
This  method  is  akin  to  comb  filtering  which  extracts  a  waveform  by  processing  the  sum  with  a 
filter  derived  by  placing  its  resonances  about  multiples  of  an  assumed  fundamental  frequency.'*’^ 
Although  these  methods  use  more  accurate  frequency  estimates  than  from  peak-picking  the 
summed  STFTM,  the  accuracy  of  the  corresponding  amplitudes  and  phases  is  limited,  as  before, 
by  the  tendency  of  frequencies  of  the  two  waveforms  to  often  be  closely  spaced.  As  we  will  see  in 
Section  6,  enhancement  obtained  by  the  method  of  frequency  sampling,  therefore,  is  small,  in 
spite  of  the  large  a  priori  information  required  by  this  method. 

2.3  The  Problem  of  Gosely  Spaced  Frequencies 

In  the  last  section  we  saw  that  although  the  summed  waveform  x(n)  =  x^fn)  X|,(n)  is  well 
represented  by  peaks  in  the  STFT  of  x(n),  the  sine-wave  amplitudes  and  phases  of  the  individual 
waveforms  are  not  easily  extracted  from  these  values.  In  this  section  we  investigate  the  problem 
of  extracting  the  sine-wave  amplitudes  and  phases  of  x^fn)  and  X|,(n)  from  the  STFT  of  x(n). 

Let  Sp(n)  represent  the  pth  windowed  speech  segment  extracted  from  a  time-shifted  version 
of  the  sum  of  two  sequences 

Sp(n)  =  w(n)[Xa(n  +  pL)  +  Xb(n  +  pD]  ;  -  <n<  (2.5) 

where  L  is  the  time  shift  between  segments  and  where  w(n)  is  nonzero  over  the  interval 
-  (N  -  l)/2  <  n  <  (N  -l)/2,  and,  in  this  study,  is  given  by  the  Hanning  window.*^  The  short  time 
Fourier  transform  (STFT)  of  the  summed  waveforms,  Spfcu),  is  given  by 

(N-I)/2 

Sp((o)  =  2  Sp(n)  e-j""  (2.6) 

n=-(N-l)/2 
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Figure  2-4.  Reconsiruciion  by  frequency  sampHttg. 


which,  in  practice,  is  computed  at  uniformly  spaced  samples  with  the  fast  Fourier  transform 
(FFT).i^  By  substituting  (2.4)  and  (2.5)  into  (2.6),  Equation  (2.6)  can  be  written  as  a  summation 
of  scaled  and  shifted  versions  of  the  transform  of  the  analysis  window. 


M. 

1 

M. 

Sp(a>)=  2 

~2 

a^  exp(j<6j,,k)W(<o  -  <Oa,k)  + 

2 

k=l 

A 

k=-l 

Mh 

1 

Mb 

^  2 

~2 

bk  exp(j<6b^)W(<o  -  an,  k)  + 

2 

k=l 

k=-l 

1 

y  ait  exp(-j«^a,k)W(<u  +  <Ma.k)  (2-7) 


1 

—  bit  Cxp(-j<^^)W(a»  +  aib,it) 


where  W(a>)  denotes  the  Fourier  transform  of  the  analysis  window  w(n)  and  where  for  simplicity 
we  assume  that  the  time  shift  of  the  analysis  frame  in  (2.5)  is  zero. 


The  success  of  extracting  sine-wave  parameters  by  peak-picking,  described  in  the  previous 
Section  2.2,  depends  on  the  properties  of  W(at),  the  Fourier  transform  of  the  analysis  window. 
The  effective  bandwidth  of  W(<u)  is  inversely  proportional  to  N,  the  duration  of  the  analysis  win¬ 
dow.  Longer  window  lengths  give  rise  to  narrower  spectral  main  lobes. If  the  spacing  between 
the  shifted  versions  of  W((u)  in  (2.7)  is  such  that  the  main  lobes  do  not  overlap,  a  reasonable 
strategy  for  extracting  the  model  frequencies  and  performing  the  separation  is  the  method  of 
peak-picking.  For  the  case  of  summed  speech  waveforms,  however,  this  constraint  is  not  often 
met  since  the  analysis  window  cannot  be  made  arbitrarily  large.  Even  when  the  frequencies  are 
known  a  priori,  i.e.,  the  method  of  freqmncy  sampling  is  used,  when  the  frequencies  are  closely 
spaced,  accurate  estimates  of  the  sine-wave  amplitudes  and  phases  are  generally  not  obuined. 


Figure  2-5  illustrates  an  example  where  the  frequencies  are  spaced  closely  enough  to  prevent 
accurate  separation  by  the  above  methods.  Figures  2-5(a)  and  2-5(b)  depict  the  STFTM  of  two 
steady-state  vowels  over  a  25  ms  interval.  The  vowels  have  roughly  equal  intensity  and  belong  to 
two  speakers  who  have  dissimilar  fundamental  frequencies.  The  STFTM  of  the  summed  wave¬ 
forms  appears  in  Figure  2-5(c).  A  subset  of  the  main  lobes  of  the  Fourier  transform  of  the  analy¬ 
sis  windows  overlap  and  add  such  that  they  merge  to  form  a  single  composite  lobe  (since  the 
addition  is  complex,  lobes  may  destructively  interfere  as  well).  When  the  peak-picking  strategy  is 
applied  to  the  STFTM  of  the  summed  speech  waveform,  the  process  may  allot  a  single  frequency 
to  represent  these  composite  structures.  For  this  reason,  the  frequency  sampling  strategy  will  also 
have  difficulty  in  recovering  the  individual  sine-wave  amplitude  and  phase  parameters. 

One  approach  to  extracting  the  underlying  amplitude  and  phase  of  the  STFT  of  Xjj(n)  and 
X|,(n)  is  to  detect  the  presence  of  overlap  and  then  use  the  structure  of  the  analysis  window  in  the 
frequency  domain  to  help  in  the  separation.^>^  Figure  2-5  shows  that  “features”  in  the  STFTM  of 
x(n)  are  not,  however,  reliable  in  detecting  the  presence  of  a  single  composite  lobe  formed  by  two 
overlapping  lobes.  Unique  characteristics  in  the  phase  of  the  STFT  (depicted  by  dotted  lines  in 
Figure  2-5)  of  overlapping  lobes  are  also  difficult  to  determine.  For  example,  the  two  largest 
lobes  in  the  summed  spectra  are  characterized  by  both  magnitude  symmetry  and  a  flat  phase 
characteristic  which  characterizes  either  speaker  A  or  speaker  B.  Thus  any  technique  for  separa¬ 
tion  relying  on  such  features  will  be  prone  to  error. 
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Figure  2~S.  Properties  of  the  STFT  of  x(n)  =  x^n)  *  xi/n}. 

(a)  STFT  magnitude  and  phase  of  x^n). 

(b)  STFT  magnitude  and  phase  of  xi/n). 

(c)  STFT  magnitude  and  phase  of  x(n)  -  xjn)  +  xy(n). 
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2.4  The  Least  Squares  Approach 

The  discussion  of  the  previous  section  suggests  that  the  linear  combination  of  the  shifted  and 
scaled  Fourier  transforms  of  the  analysis  window  in  (2.7)  must  be  explicitly  accounted  for  in 
achieving  separation.  The  (complex)  scale  factor  applied  to  each  such  transform  corresponds  to 
the  desired  sine-wave  amplitude  and  phase,  and  the  location  of  each  transform  is  the  desired  sine- 
wave  frequency.  Parameter  estimation  is  difficult,  however,  due  to  the  nonlinear  dependence  of 
the  sine-wave  representation  on  phase  and  frequency. 

The  approach  to  separation  in  this  report  first  assumes  a  priori  frequency  knowledge,  and 
performs  a  least  squares  fit  to  the  summed  waveform  with  respect  to  the  unknown  sine-wave 
amplitude  and  phase  parameters.  In  the  next  section  we  show  that  this  solution  is  equivalent  to 
solving  for  the  sine-wave  amplitudes  and  phases  via  the  linear  relationships  suggested  by  (2.7). 
Figure  2-6  puts  this  least  squares  approach  in  perspective  with  other  strategies  we  have  discussed 
and  which  we  will  further  compare  in  the  sequel.  In  contrast  to  the  frequency  sampling  method 
which  samples  the  STFT  at  the  known  frequencies,  in  one  implementation  of  the  least  squares 
method,  the  frequency  sampling  solution  is  used  as  an  initial  guess  and  iterated  upon  to  form 
more  accurate  estimates.  Figure  2-6  shows  that  the  frequencies  themselves  may  also  be  estimated 
via  a  least  squares  formulation.  In  Section  4,  this  estimation  problem  will  be  simplified  by  con¬ 
straining  the  frequencies  to  be  harmonically  related. 
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Figure  2-6.  The  least  squares  approach. 


3.  THE  LEAST  SQUARES  SOLUTION 


In  this  section,  we  transform  the  nonlinear  problem  of  forming  a  least  squares  solution  for 
the  sine-wave  amplitudes,  phases,  and  frequencies  into  a  linear  problem.  We  accomplish  this  by 
assuming  the  sine-wave  frequencies  are  known  a  priori,  and  by  solving  for  the  real  and  imaginary 
components  of  the  quadrature  representation  of  the  sine  waves,  rather  than  solving  for  the  sine- 
wave  amplitudes  and  phases.  The  previous  section  suggests  that  these  parameters  can  be  obtained 
by  exploiting  the  linear  dependence  of  the  STFT  on  scaled  and  shifted  versions  of  the  Fourier 
transform  of  the  analysis  window.  We  begin  this  section  with  a  solution  based  on  this  observa¬ 
tion,  and  then  show  that  the  parameters  derived  by  this  approach  represent  the  sine-wave 
parameters  chosen  by  forming  a  least  squares  fit  to  the  summed  speech  waveforms. 


3.1  Solving  for  the  Sine- Wave  Parameters 


Figure  3-1  illustrates  how  the  main  lobes  of  two  shifted  versions  of  the  Fourier  transform  of 
the  analysis  window,  W(<u),  typically  overlap  when  they  are  centered  at  two  closely  spaced  fre¬ 
quencies  aj|  and  corresponding  to  speaker  A  and  speaker  B,  respectively,  each  consisting  of  a 
single  frequency.  Figure  3-1  suggests  a  strategy  for  separation  by  solving  the  following  linear 
equations 

f  1  w(Aa,)i  rs>,)i  rS(a>,)] 


[W(A«,)  1  J  LSb(a>2)J  [S(^«2)J  (3.1) 

where  and  Sb((U2)  denote  the  samples  of  the  STFTs  at  known  frequencies  <w|  and  <02,  and 

Aoj  is  the  distance  in  frequency  between  them.  The  amplitudes  and  phases  of  Sa((U|)  and  Sb(<i>2) 
represent  the  unknown  parameters  of  the  two  underlying  sine  waves.  The  STFT  of  the  sum  is 
denoted  by  S(o»)  [for  simplicity  the  subscript  “p”  in  (2.7)  has  been  removed].  The  Fourier  trans¬ 
form  of  the  analysis  window  is  denoted  by  W((u)  with  normalization  W(0)  =  1.  Since  the  window 
transform  is  real,  the  matrix  in  the  left  side  of  (3.1)  is  real;  however,  the  STFT  of  the  waveform 
is  complex,  so  that  the  complex  solution  to  (3.1)  can  be  obtained  by  solving  separately  the  real 
and  imaginary  parts  of  the  matrix  equation.  Equation  (3.1)  is  not  exact  since  the  contribution 
from  the  Fourier  transforms  of  the  analysis  window  centered  at  -tt>|  and  -iU2  has  not  been 
included  (in  practice,  the  signal  to  be  transformed  is  real  and  so  both  positive  and  negative  fre¬ 
quency  contributions  will  exist).  For  simplicity,  we  assume  this  contribution  is  negligible. 

Since  from  (2.7)  the  STFT  of  a  sum  of  windowed  sinusoids  is  a  sum  of  shifted  and  scaled 
versions  of  W((u),  the  two-lobe  case  of  Figure  3-1  can  be  simply  extended  to  the  case  where  there 
are  M  overlapping  lobes  of  the  form  in  Figure  3-1.  Specifically,  a  relation  can  be  written  which 
reflects  the  linear  dependence  of  the  STFT  on  all  M  lobes  (see  Appendix  A). 

Ha  =  2Re  [S(ft>)]  (3.2a) 

Hg=-2lm  [S(aj)]  (3.2b) 

where  Sj(w),  consisting  of  STFT  samples,  is  a  vector  function  of  the  sinusoidal  frequency  vector  w 
given  by 
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ilt  =  ((U|,  a>2,  <03, . .  <wm)^  J  a»i  <  £U2  <  «>3  < 

consisting  of  frequencies  from  both  speaker  A  and  speaker  B  and  where. 


W(0) 

W(a*2  ~  **1) 
W(aj3-<u,) 


H  = 


W(cu4-a»,) 


W(<U|  -  6*2)  W(£U|  -  W3) 
W(0)  W(a>2  -  o>3) 

W(a»3  -  012)  W(0) 

W(a>4  -  <02)  W(a)4  -  013) 


W(tu,  -  wm) 

W(a>2  -  cum) 

W(<U3  -  (Om) 
W(a<4  -  cum) 


W(a*M-6>i)  W(a>M-<«>2)  W(a,M-‘«3) 


Figure  3-1.  Two  overlapping  main  lobes  of  shifted  and  scaled  versions  of  iVfio). 


The  vectors  a  and  J  consist  of  estimates  of  the  unknown  parameters  of  (2.4)  but  in  quadrature 
form 


4 


ft 


M  M 

x(n)  =  2  Ofc  cos  (oii^n)  +  2]  (<**k'*)  (3.5a) 

k=0  k=0 


with 


and 


Ok  =  ajj  cos  (^) 


(3.5b) 


=  aij  sin  (0k)  (3-5c) 

where  denotes  estimate  and  where  M  =  ♦  Mj,.  Equation  (3.5)  can  also  be  expressed  in 

terms  of  polar  coordinates 


M 

*(n)  =  5]  Ck  cos  (cukO  +  ^k)  (3.6) 

k=l 

Ck  =  yaj  *  ^k  •  h  -  03k/«k) 

For  speaker  separation.  Equation  (3.6)  can  be  partitioned  since  we  assume  the  partitioning  of  the 
frequency  vector  u  is  known  a  priori 

M,  ^  ^  Mfc  ^ 

X(n)  =  2  ^k  cos  (oja^k"  ‘?a,k)  +  2)  COS  (wj,  kO  +  0b.k)  (3-7) 

k=l  k=l 

and  thus  solution  to  the  matrix  equation  (3.2)  yields  the  sine*wave  amplitudes  and  phases  of  the 
two  underlying  speech  components. 

Figure  3-2  gives  an  example  of  the  STFTM  of  two  summed  frames  of  vocalic  speech  and 
Figure  3-3  shows  the  corresponding  H  matrix.  Although  the  H  matrix  has  values  that  occur  off 
of  the  main  diagonal,  these  values  fall  off  rapidly  as  the  distance  from  the  main  diagonal 
increases.  This  property  reflects  the  condition  that  overlap  among  the  main  lobes  of  scaled  and 
shifted  versions  of  the  Fourier  transform  of  the  window  occurs  primarily  between  neighboring 
lobes  of  different  speakers  (the  analysis  window  is  assumed  long  enough  so  that  main  lobes  of  a 
single  speaker  do  not  overlap).  Occasionally,  however,  the  H  matrix  will  have  a  broader  diagonal 
arising  when  the  speakers  are  low  in  pitch  and  the  window  lengths  are  short  in  duration. 

The  preceding  analysis  views  the  problem  of  solving  for  the  sine-wave  amplitudes  and  phases 
in  the  frequency  domain.  Alternatively,  the  problem  can  be  viewed  in  the  time  domain.  In 
Appendix  B,  it  is  shown  that,  for  suitable  window  lengths,  the  vectors  a  and  £  that  satisfy  (3.2) 
also  approximate  the  vectors  that  minimize  the  weighted  mean  square  distance  between  the  mea¬ 
sured  speech  frame,  s(n),  and  the  steady  state  sinusoidal  model,  x(n),  for  summed  vocalic  speech 
with  the  sinusoidal  frequency  vector  w  and  unknown  parameters  o  and  Specifically,  the  fol¬ 
lowing  minimization  is  performed  with  respect  to  a  and  £ 
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Figure  3-2.  Demonstration  of  two-lobe  overlap;  (a)  STFTM  of  Xffn),  (b)  STFTM  of  x^fn). 
(c)  STFTM  of  x(n)  =  Xg(n)  ♦  xi/n),  and  (d)  sine-wave  frequencies. 
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Figure  i-i.  H  matrix  for  the  example  in  Figure  3~2. 

(N-l)/2 
min  2] 
n=-(N-l)/2 

The  error  weighting  in  the  least-squared  error  (LSE)  problem  (3.8)  is  the  analysis  window  that  is 
used  to  obtain  the  STFT.  Solution  to  x(n)  in  (3.8)  is  given  by  x(n)  in  (3.5).  Thus  the  matrix 
equation  in  (3.2)  can  be  arrived  at  by  two  apparently  different  approaches;  in  the  frequency 
domain,  by  investigating  the  linear  dependence  of  the  STFT  on  scaled  and  shifted  versions  of  the 
Fourier  transform  of  the  analysis  window,  or,  in  the  time  domain,  by  the  waveform  minimization 
given  in  (3.8).  These  two  interpretations  have  analogies  in  the  one-speaker  case  where  least- 
squares  minimization  in  the  time  domain  leads  to  a  solution  which  chooses  sine-wave  amplitudes 
and  phases  at  peaks  in  the  STFT.^ 

3.2  Implementation 

The  frequency  estimates  used  in  the  solution  (3.2)  were  obtained  by  peak-picking  the 
STFTM  of  each  separate  waveform.  A  4096-point  FPT  was  found  to  give  sufficient  frequency 
resolution  for  adequate  separation.  The  Gauss  Seidel  iterative  method  was  then  used  in  solving 
(3.2).  The  algorithm  is  computationally  inexpensive,  easy  to  program,  and  exhibited  stability  and 
rapid  convergence  for  the  cases  tested.  Convergence  of  this  algorithm  is  guaranteed  for  positive 
definite  matrices,  a  property  of  the  matrices  in  our  least-squares  problem. 
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The  vector  obtained  by  sampling  the  STFT  at  the  sine-wave  frequencies  was  used  as  an 
initial  guess  in  the  iterative  algorithm,  i.e.,  the  solution  from  the  frequency-sampling  method.  The 
iterative  approach  to  solving  (3.2)  may  be  looked  upon,  therefore,  as  improving  the  initial  ampli¬ 
tude  and  phase  estimates  obtained  by  sampling  the  summed  STFT  at  the  frequencies  of  each 
speaker. 

3J  An  Example 

An  example  is  given  in  Figure  3-4  which  shows  the  LSE  solution  of  the  two  signals  making 
up  the  sununed  speech,  where  speech  waveform  A  is  about  3  dB  below  speech  waveform  B.  Both 
the  waveform  and  STFTM  estimates  are  compared  with  their  respective  originals.  The  figure 
gives  the  outcome  of  SO  iterations  of  the  Gauss-Seidel  algorithm,  although  in  practice  fewer  itera¬ 
tions  are  required. 

3.4  Sensitivity 

As  frequencies  of  speaker  A  come  arbitrarily  close  to  those  of  speaker  B,  the  conditioning  of 
the  H  matrix  deteriorates  to  where  the  matrix  becomes  singular  (see  Appendix  C).  For  these 
cases,  solving  the  LSE  problem  does  not  permit  separation.  In  detecting  these  cases,  the  spacing 
between  neighboring  frequencies  is  monitored.  A  single  sinusoid  is  used  to  represent  two  sinu¬ 
soids  whose  frequencies  are  closely  spaced,  e.g.,  less  than  25  Hz  apart.  Closely-spaced  frequencies 
which  satisfy  this  criterion  are  then  combined  as  single  entries  in  the  LSE  Equations  (3.2)  and 
(3.3). 

Figures  3-S(a)  and  3-5(b)  illustrate  such  an  example  where  speaker  B  is  20  dB  below  speaker 
A.  One  lobe  is  missing  in  the  reconstructed  STFTM  of  each  speaker.  The  monitoring  procedure 
detected  the  presence  of  two  frequencies  which  are  close  enough  to  cause  ill-conditioning  of  the 
H  matrix.  These  frequencies,  merged  as  one  in  the  LSE  solution,  were  not  used  in  the  recon¬ 
struction.  A  strategy  for  resolving  these  ambiguities  is  proposed  in  Section  5.2. 
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Figure  3-4.  Separation  of  summed  speech  waveforms;(a)  Speaker  A(upper)  com¬ 
pared  to  estimate  of  speaker  A(lower)and  (b)  speaker  B(upper)  compared  to  estimate 
of  speaker  Bflower). 


21 


0  10  20  30  1000  2000  3000 

TIME  (m«|  FREQUENCY  (Hz) 


Figure  3-5.  Demonstration  of  ill-conditioning  of  the  H  matrix;  fa)  speaker  A  (upper) 
compared  to  estimate  of  speaker  A(lo¥/er)  and  (b)  speaker  Bfupper)  compared  to 
estimate  of  speaker  Bflower). 
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4.  TRACKING  THE  FUNDAMENTAL  FREQUENCY  PAIR 


Simultaneous  estimation  of  sine-wave  amplitudes,  phases,  and  frequencies  is  a  difficult 
nonlinear  problem;  the  assumption  of  a  priori  frequency  knowledge,  relied  on  in  the  previous  sec¬ 
tion,  helped  to  make  the  problem  linear.  In  this  section,  we  reduce  the  dimensionality  of  the  non¬ 
linear  problem  by  assuming  frequencies  for  each  speaker  can  be  represented  by  a  multiple  of  a 
fundamental  frequency,  cu^  for  speaker  A  and  a*],  for  speaker  B. 

Under  this  harmonic  assumption,  since  the  model  (2.3)  and  (2.4)  is  a  nonlinear  function  of 
the  fundamental  frequencies,  a  simple  closed-form  solution,  based  on  the  least-squares  approach, 
does  not  exist.  Under  certain  conditions,  however,  the  two  fundamental  frequencies  can  be 
tracked  in  time  by  using  estimates  on  each  analysis  frame  as  initial  estimates  in  a  refinement 
procedure  for  the  next  frame.  In  particular,  if  the  analysis  frames  are  closely  spaced,  then  pitch 
changes  slowly  across  two  consecutive  frames  k  and  k  -•■I.  The  pitch  estimate  obtained  on  frame  k 
can  then  be  used  as  the  initial  guess  for  estimating  the  pitch  on  frame  (k  1).  A  grid  search  is 
proposed  as  a  means  by  which  the  tracking  procedure  be  initialized.  The  iterative  method  of 
steepest  descent is  then  used  for  updating  the  pitch  estimate  on  each  frame.  Figure  4-1  summar¬ 
izes  our  approach  to  estimating  pitch. 

4.1  The  Pitch  Update  Procedure 

On  each  analysis  frame,  the  method  of  steepest  descent  updates  an  initial  pitch  pair  estimate 
by  adding  to  the  estimate  a  scaled  error  gradient  with  respect  to  the  unknown  pitch  pair.  Specifi¬ 
cally,  the  pitch  pair  estimate  on  the  kth  frame  is  updated  as 

(‘«a.‘%)k+l  =  ("a."b)k  "  (4. 1) 

where  Ae  is  the  differential  error.  The  error  signal  for  the  update  (4.1)  is  the  weighted  least-mean 
square  difference  between  the  reconstructed  waveform  model  estimate  and  the  measured  summed 
speech  waveform,  i.e..  Equation  (3.8),  which  is  repeated  here  for  convenience 

(N-l)/2 

c(n)  =  min  2]  w(n)[x(n)  -  s(n)]2  (4.2) 

n=-(N-l)/2 

The  solution  to  x(n)  in  (4.2),  x(n),  has  the  form  of  (3.7),  but  with  distinct  frequencies  replaced  by 
multiples  of  a  fundamental 

M,  Mb  ^ 

x(n)=  2  Sk  cos  (<uakn  +  0ajj)  +  ^  cos  (a>bkn  +  ^b.k)  (4-3) 

k=l  k=l 

and  where,  for  a  given  pitch  pair,  th;  minimization  in  (4.2)  takes  place  with  respect  to  the 
unknown  sine-wave  amplitudes  and  phases.  For  a  given  pitch  pair,  the  reconstructed  waveform  is 
obtained,  therefore,  by  using  the  amplitudes  and  phases  that  result  from  the  solution  to  the  LSE 
problem,  and  thus  the  error  surface  over  which  we  are  minimizing  is  itself  a  minimum  for  each 
pitch  pair. 
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Figure  4-1.  An  algorithm  for  tracking  the  fundamental  frequencies. 
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Figure  4*2  illustrates  this  iterative  update  procedure.  The  convergence  factor,  a,  governs  both 
the  rate  of  convergence  and  the  stability  of  the  iteration.  The  gradient  was  approximated  by  the 
first  central  difference  taken  on  a  finely  sampled  error  surface.  Samples  of  the  error  surface, 
required  by  the  central  difference  computation,  were  obtained  by  performing  the  LSE  operation 
to  determine  the  minimizing  amplitudes  and  phases  for  the  various  pitch  pairs  shown  in 
figure  4-2.  The  iteration  terminates  once  the  error  incurred  on  a  step  exceeds  the  error  incurred 
on  the  preceding  step.  After  each  termination,  the  iteration  can  be  restarted  with  a  reduced  step 
size  A  in  order  to  obtain  greater  accuracy. 

4.2  Estimation  of  an  Initial  Pitch  Point 

In  order  to  initiate  the  pitch  estimation  algorithm,  we  must  choose  a  pitch  pair  on  the  start¬ 
ing  frame.  Specifically,  we  sample  the  two-dimensional  error  surface  in  (4.2)  and  choose  a  pitch 
pair  that  yields  the  minimum  error.  Figure  4-3  illustrates  a  simple  one-dimensional  representation 
of  the  assumed  error  surface  as  a  function  of  pitch.  The  function  exhibits  several  extremum,  but 
a  single  global  minimum.  A  grid  search  must  be  based  on  a  sampled  version  of  this  function. 
Figures  4-3(b)  and  4-3(c)  depict  two  different  sampling  intervals.  In  the  first  case,  the  sampling 
procedure  indicates  the  region  of  convexity  where  the  global  minimum  occurs.  In  the  second 
instance,  the  search  is  too  granular,  and  the  minimum  value  of  the  sequence  corresponds  to  a 
region  of  convexity  that  is  attributed  to  a  local,  but  not  global,  minimum.  The  severity  of  mul¬ 
tiple  local  minimum  is  often  tied  to  the  high-frequency  content  in  the  signal.*'*  The  tracking 
procedure  developed  in  the  previous  section  will  not  tolerate  this  type  of  error.  For  this  reason,  a 
conservative  sampling  interval  was  chosen.*^ 

A  practical  pitch  range  was  given  as  a  boundary  to  the  grid  search.  For  example,  it  might  be 
known  that  speaker  A’s  pitch  is  somewhere  between  100  Hz  and  150  Hz,  and  that  speaker  B’s 
pitch  is  between  150  Hz  and  220  Hz.  This  a  priori  information  reduced  the  search  range  and  cir¬ 
cumvented  the  problem  of  examining  candidate  fundamentals  which  were  factors  of  the  underly¬ 
ing  fundamentals. 

4.3  An  Example 

For  a  limited  data  base,  pitch  contours  were  obtained  by  invoking  the  tracking  procedure 
outlined  in  Sections  4.1  and  4.2.  Figure  4-4  illustrates  two  pitch  contours  estimated  from  summed 
utterances  of  roughly  equal  intensities.  The  upper  contour  corresponds  to  the  utterance,  “We 
were  away  in  Walla  Walla,”  spoken  by  a  female.  The  lower  contour  corresponds  to  the  utterance, 
“Why  were  you  away  a  year  Roy?,”  spoken  by  a  male.  The  solid  line  indicates  the  pitch  that  was 
extracted  prior  to  summation.*'*  The  dotted  line  indicates  the  pitch  that  was  extracted  from  the 
summation  by  means  of  the  pitch  tracking  algorithm.  The  two  initial  points  (t  =  0)  on  the  con¬ 
tours  were  extracted  by  means  of  the  grid  search  described  in  Section  4.2. 
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Figure  4-2.  Gradient  search,  efw/,  012^  is  the  error  surface  sampled  at  the  pitch  pair  (u, 
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Figure  4~4.  Two  pitch  contours  extracted  from  summed  vocalic  waveforms. 


4.4  limitatioiw 


Although  pitch  extraction  was  successful  on  a  number  of  all-voiced  passages,  the  method 
suffers  from  the  following  disadvantages: 

(1)  The  error  from  the  LSE  processor  is  used  in  pitch  estimation;  thus  the  pitch 
estimate  is  susceptible  to  matrix  conditioning  problems  and  lapses  from  sta- 
tionarity  where  the  periodic  model  breaks  down.  The  data  base  for  which 
pitch  was  extracted  was  limited  to  nonintersecting  pitch  contours.  When  pitch 
contours  cross,  the  harmonic  frequencies  align,  and  the  conditioning  of  the 
LSE  problem  deteriorates. 

(2)  The  method  of  tracking  the  pitch  contours  from  one  frame  to  the  next 
depends  on  smoothness  and  continuity,  thus  precluding  pauses  and 
consonants. 

(3)  The  pitch  algorithm  was  found  useful  for  roughly  a  3  dB  difference  in  inten¬ 
sity  between  speakers.  Larger  intensity  differences  prohibited  pitch  tracking  of 
the  lower  speaker. 
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5.  MULTI-FRAME  INTERPOLATION 


We  saw  in  Section  3.4  that  the  least-squares  solution  to  estimating  sine-wave  amplitudes  and 
phases  can  become  ill-conditioned  when  sine-wave  frequencies  of  the  two  underlying  waveforms 
become  arbitrarily  close.  In  this  section,  we  propose  a  synthesis  scheme  to  help  resolve  the  case 
where  the  ill-conditioning  of  the  H  matrix  in  (3.4)  does  not  permit  the  solution  to  the  LSE  prob¬ 
lem.  This  strategy  exploits  the  time  evolution  of  the  sine-wave  amplitudes  and  phases. 

5.1  Approach 

Assume  the  availability  of  either  accurate  frequency  estimates  or  harmonic  frequency  esti¬ 
mates  in  the  form  of  pitch  contours.  The  LSE  solution  (3.2)  is  then  used  to  extract  the  ampli¬ 
tudes  and  phases  of  a  model  with  the  given  frequencies.  Two  conditions  under  which  ill- 
conditioning  of  the  H  matrix  can  occur  are  illustrated  in  Figure  S-1. 


(a)  CROSSING  FREQUENCY  TRACKS 


(b)  CROSSING  PITCH  CONTOURS 

Figure  5-1.  Failure  of  the  least  squares  solution  with  closely-spaced  frequencies. 
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In  the  first  case,  a  phase  and  amplitude  ambiguity  occurs  in  solving  (3.2)  whenever  isolated 
frequencies  belonging  to  one  speaker  align  themselves  with  frequencies  belonging  to  the  second 
speaker.  In  this  case,  the  amplitude  and  phase  separation  cannot  be  resolved  and,  therefore, 
“spectral  holes”  arise.  Ill  conditioning  may  even  occur  when  the  two-pitch  periods  arc  markedly 
dissimilar.  Consider  the  case  of  the  two  fundamental  frequencies  100  Hz  and  151  Hz.  The  two 
harmonic  sets  and  S),  can  be  generated. 

Sa  =  (100,  200,  300,  400,  500,  600,  700)  (5.1a) 

Sb  =  (151,  302,  453,  604,  755,  906,  1057)  (5.1b) 

For  this  case,  numerical  instability  arises  at  the  overlapping  pairs  (300,302)  and  (600,604). 

In  the  second  case  of  Figure  5-1,  ill  conditioning  occurs  when  the  pitch  contours  cross.  In 
this  case  all  of  the  harmonics  align.  Therefore,  separation  is  ambiguous  at  all  frequencies  and 
none  of  the  sine-wave  amplitudes  and  phases  can  be  resolved.  An  entire  frame  of  data  is  then 
deleted  from  the  reconstruction. 

The  sinusoidal  reconstruction  strategy  outlined  in  Section  2.1  can  be  used  to  interpolate 
component  sine  waves  over  regions  of  ill  conditioning  of  the  H  matrix.  If  the  kth  frame  contains 
frequencies  that  are  too  close  to  resolve,  the  corresponding  amplitudes  and  phases  are  interpo¬ 
lated  between  frame  (k  -  p)  and  frame  (k  +  q),  as  in  the  reconstruction  over  a  single  frame  (sec 
Figure  5-2).  Specifically,  the  linear  amplitude  and  cubic  phase  interpolation  strategies  of  Sec¬ 
tion  2.1  are  used  with  the  sine-wave  amplitudes  and  phases,  respectively,  measured  at  the  end¬ 
points  of  frames  (k  -  p)  and  (k  +  q).  The  integers,  p  and  q,  are  chosen  so  that  the  frames  (k  -  p) 
and  (k  +  q)  lie  in  regions  where  the  amplitudes  and  phases  can  be  resolved.  This  procedure  is 
referred  to  as  multi-frame  interpolation.  With  multi-frame  interpolation,  the  resulting  frame  inter¬ 
val  is  typically  four  frames  or  20  ms,  but  can  extend  as  long  as  100  ms  during  stationary  regions. 
A  25  Hz  frequency  “closeness”  criterion  was  used  in  deciding  when  to  perform  interpolation. 

When  the  pitch  contours  intersect,  this  procedure  must  be  performed  for  every  frequency 
component.  When  only  a  subset  of  the  frequencies  converge,  the  procedure  is  performed  on  only 
those  frequencies.  The  algorithm  was  programmed  to  handle  these  different  forms  of  interpola¬ 
tion  at  any  one  point  in  time.  As  illustrated  in  Figure  5-3,  short  or  long  interval  interpolation,  or 
no  interpolation  at  all,  can  be  performed  along  each  frequency  track.  A  long  interpolation  inter¬ 
val  can  occur  in  a  steady-state  region  where  two  frequency  tracks  lie  close  to  each  other  over  a 
long  period  of  time;  while  a  short  interval  of  interpolation  can  occur  in  rapidly  varying  regions 
where  two  frequency  tracks  cross  at  a  sharp  angle. 

5.2  An  Example 

Figure  5-4(a)  depicts  a  frame  of  vocalic  speech  (left)  and  the  STFTM  for  that  frame  (right). 
Figure  5-4(b)  depicts  the  reconstruction  that  is  missing  the  fundamental  frequency  due  to  ill- 
conditioning  of  the  H  matrix.  In  Figure  5-4(c),  the  missing  fundamental  is  resolved  via  multi- 
frame  interpolation. 
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6.  EXPERIMENTAL  RESULTS 


This  section  summarizes  the  results  of  listening  tests  using  the  least  squared  error  approach 
to  sine-wave  based  speaker  separation  and  enhancement.  The  LSE  method  is  examined  under  the 
following  three  conditions:  (1)  a  priori  frequencies,  (2)  a  priori  pitch,  and  (3)  estimated  pitch.  The 
importance  of  multi-frame  interpolation  is  demonstrated. 

6.1  Experimental  Procedure 

Microphone  speech  was  digitized  and  sampled  at  10  kHz.  The  data  base  was  comprised  of  a 
variety  of  vocalic  sentences;  each  sentence  was  recorded  from  male  and  female  speakers.  The 
average  pitch  of  the  speakers  ranged  between  100  and  200  Hz.  Prior  to  summation,  the  sampled 
utterances  were  scaled  to  the  desired  target  to  interferer  ratio.  The  calculation  of  this  ratio  was 
based  on  the  long  time  average  of  the  signal  energy. 

Four  distinct  sentences  and  three  different  speakers  were  used  for  the  listening  tests  (see 
Table  6-1).  Test  utterances  included  the  summed  speech  of  male  with  male  speakers,  female  with 
female  speakers,  and  female  with  male  speakers.  The  tests  were  conducted  with  ten  listeners  and 
consisted  of  two  different  forms.  In  one  test  (Test  A),  the  listeners  were  given  the  original 
summed  utterance  followed  by  two  enhanced  versions  (randomly  arranged)  of  the  target  speaker. 
The  listener  was  asked  to  choose  the  enhanced  version  he  preferred  in  terms  of  intelligibility.  In 
the  second  test  (Test  B),  the  listeners  compared  the  intelligibility  of  the  target  speaker  in  the 
summed  utterance  to  that  in  the  processed  version  of  the  utterance  and  judged  the  improvement 
in  the  processed  version  on  a  scale  of  0  to  3  given  by  (0)  no  improvement,  (1)  slight  improve¬ 
ment,  (2)  definite  improvement,  and  (3)  significant  improvement. 


TABLE  6-1 

Data  Base  Used  in  Listening  Tests 

A.  "We  were  away  in  Walla  Walla."  (male  1) 

B.  "Our  rule  will  lower  your  ear  away."  (female  1) 

C.  'We  were  away  in  Walla  Walla."  (male  2) 

D.  "All  wear  your  ear  low."  (male  1 ) 

E.  'Wear  your  ear  low."  (female  1 ) 

F.  "All  rare  laws  are  well."  (female  1 ) 
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The  listening  tests  were  performed  within  an  acoustic  sound  chamber.  The  speech  was  played 
to  the  listener  through  a  set  of  head  phones.  The  listener  was  seated  at  a  computer  terminal 
which  served  as  a  means  of  cataloging  his  evaluation  and  also  served  as  an  interface  to  the  sound 
system.  The  listener  could  monitor  the  summed  utterance  and  the  processed  versions  of  the  utter¬ 
ance  at  keyboard  request,  and  as  often  as  desired.  The  listener  was  prompted  for  a  rating  the 
moment  he  indicated,  through  the  keyboard,  that  he  was  ready  to  evaluate  the  utterance.  Once 
an  utterance  was  evaluated,  the  listener  could  not  return  to  that  utterance. 

Three  of  the  four  sentences  were  nonsensical  to  prohibit  the  listener  from  inferring  the  pas¬ 
sage  through  context.  The  recurrence  of  the  same  sentences,  however,  allowed  the  listeners  to 
learn  the  content  of  the  passages  as  the  test  proceeded.  Thus,  ratings  may  not  have  been  consis¬ 
tent  throughout  a  single  test. 

6.2  Processing  Schemes 

The  LSE  processing  scheme  with  a  priori  frequencies  and  a  priori  pitch,  the  LSE  processing 
scheme  with  estimated  pitch,  and  the  frequency  sampling  scheme  were  evaluated.  The  LSE  pro¬ 
cessing  schemes  were  performed  both  with  and  without  multi-frame  interpolation. 

In  experiments  with  a  priori  frequencies,  the  LSE  processor  was  provided  with  frequencies  of 
all  the  major  peaks  of  each  speaker’s  STFTM  prior  to  summation.  The  LSE  processor  extracted 
the  corresponding  amplitudes  and  phases  of  each  utterance  over  successive  frames.  The  frequen¬ 
cies,  amplitudes,  and  phases  associated  with  the  target  speaker  were  then  given  to  the  sinusoidal 
reconstruction  system.  In  experiments  with  a  priori  pitch,  the  pitch  contours  were  extracted  by 
standard  pitch  estimation  and  by  means  of  hand  editing,  Integer  multiples  of  the  two  funda¬ 
mentals  were  used  to  parameterize  the  two  frequency  sets.  The  frequency  sampling  procedure  de¬ 
scribed  in  Section  3.2  was  invoked  to  determine  whether  the  LSE  approach  yielded  any  improve¬ 
ment  over  a  more  direct  strategy  that  made  use  of  a  priori  frequencies,  and  also  to  provide  a 
reference  for  all  other  processing  schemes.  In  this  method  the  STFT  of  the  summed  waveform 
was  sampled  at  the  STFTM  peaks  of  the  target  speaker  to  obtain  amplitude  and  phase  samples 
that  were  given  to  the  sine-wave  synthesizer. 

6  J  Listening  Tests 

The  effect  of  processing  was  to  signiAcantly  enhance  the  intelligibility  of  the  target  speaker, 
while  reducing  the  interfering  speech  to  “noise”  or  sometimes  highly  garbled  cross  talk.  The  rem¬ 
nant  interfering  speech  was  generally  of  much  lower  intensity  than  that  perceived  in  the  original 
sum,  and  in  some  cases,  was  not  perceived  as  speech.  Good  signal  recovery  and  intelligibility 
improvements  were  observed  over  a  range  of  target-to-interferer  ratios  of  9  to  -16  dB,  although 
listening  tests,  described  below,  were  performed  at  -16  dB. 
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Multi-Frame  Interpolation 


To  determine  the  importance  of  multi-frame  interpolation,  summed  utterances  were  formed 
with  the  target  speaker  16  dB  below  the  interferer.  The  test  examined  the  effect  of  processing 
with  a  priori  frequencies  and  a  priori  pitch,  with  and  without  multi-frame  interpolation. 

Table  6-2  illustrates  a  strong  preference  for  speech  processed  with  multi-frame  interpolation.  Test 
A,  in  particular,  showed  that  81%  of  the  listeners  preferred  multi-frame  interpolation.  Conse¬ 
quently,  all  following  experiments  were  performed  with  the  use  of  multi-frame  interpolation. 


TABLE  6-2 

Tests  Comparing  Synthesis  with  Multi-Frame  Interpolation  (MFI) 
and  No  Multi-Frame  Interpolation  (NMFI) 

TEST 

Mn 

NMFI 

NO  PREFERENCE 

A 

81% 

6% 

13% 

B 

2.12 

1.6 

« 

Frequency  Sampling 

As  a  reference,  speech  was  synthesized  using  sine-wave  parameters  derived  from  the 
frequency-sampling  method.  With  the  target  speaker  set  16  dB  below  the  interfering  speaker,  the 
remnant  of  the  interfering  speaker  was  speech-like  and  of  modest  intensity.  Only  a  minimal 
degree  of  enhancement  is  obtained;  an  average  rating  level  of  .6  was  achieved  using  Test  B. 

A  Priori  Frequencies  and  Pitch 

In  the  next  set  of  experiments,  the  result  of  LSE  processing  with  a  priori  frequencies  was 
compared  with  processing  with  a  priori  pitch,  again  with  a  16  dB  intensity  difference.  Test  B 
shows  significant  intelligibility  gains  in  both  cases.  As  illustrated  in  Table  6-3,  the  pitch-based 
system,  however,  was  not  capable  of  attaining  the  same  level  of  enhancement  as  the  frequency- 
based  system.  Additional  cross  talk  was  present  due  to  inaccuracy  in  the  pitch  estimate  and  lim¬ 
itation  to  a  harmonic  frequency  set.  Nevertheless,  significant  intelligibility  improvements  were 
attained. 

Estimated  Pitch 

Since  the  pitch  extraction  algorithm  has  been  successfully  demonstrated  at  only  roughly 
equal  intensity  levels,  a  reference  was  first  formed  by  processing  with  a  priori  frequencies  and 
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TABLE  6-3 

Results  of  Listening  Tests  Compering  Synthesis 
with  i4  Priori  Frequencies  and  A  Priori  Pitch 

TEST 

FREQUENCY 

PITCH 

NO  PREFERENCE 

A 

72% 

0% 

28% 

B 

2.6 

1.64 

• 

a  priori  pitch  at  equal  intensity  settings.  The  LSE  processor  achieved  significant  suppression  of 
the  interferer  with  both  a  priori  frequency  and  pitch.  Since  this  case  does  not  occur  often  in 
practice,  extensive  listening  tests  were  not  performed.  The  same  set  of  utterances  was  then  pro¬ 
cessed  where  a  priori  information  consisted  of  only  an  initial  point  on  the  two  dimensional  pitch 
contour.  This  initial  point  was  obtained  through  the  use  of  the  grid  search  of  Section  4.2.  A  pair 
of  pitch  contours  was  then  generated  from  the  summed  speech  waveform  by  the  method  de¬ 
scribed  in  Section  4.1.  Significant  suppression  was  obtained  in  the  reconstruction  of  the  desired 
speaker;  although,  in  contrast  to  the  case  where  a  priori  pitch  was  assumed,  some  quality  loss 
was  apparent. 

The  system  was  capable  of  handling  only  utterances  having  pitch  contours  that  followed  the 
conditions  given  in  Section  4.  The  system  was  not  capable  of  resolving  situations  in  which  the 
two  pitch  contours  crossed.  The  experiments  were  performed  on  summed  male  and  female 
speech,  since,  in  such  cases,  the  crossing  of  pitch  contours  is  less  likely  to  occur. 

6.4  Assessment 

There  are  three  principal  reasons  for  the  cross  talk  that  remains  after  processing;  modeling 
errors,  frequency  errors,  and  unresolvable  parameters. 

(1)  Modeling  errors  occur  when  the  steady-state  sine-wave  model  is  not  capable  of 
accurately  representing  an  entire  frame  of  speech,  as  when  the  vocal  tract  or  excita¬ 
tion  changes  too  rapidly. 

(2)  Degradation  arises  when  the  model  is  not  parameterized  with  the  correct  frequen¬ 
cies.  This  problem  was  made  apparent  when  the  reconstruction  with  a  priori  fre¬ 
quencies  was  compared  with  reconstruction  with  a  priori  pitch  and  estimated  pitch. 

(3)  The  multi-frame  interpolation  strategy  helps  to  resolve  cases  with  closely  spaced  fre¬ 
quencies,  but  if  the  effective  frame  length  becomes  too  large,  degradation  may 
occur. 
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7.  DISCUSSION 


This  report  described  a  method  for  talker  interference  suppression  based  on  a  sinusoidal 
representation  of  speech.  A  least  squares  approach  for  obtaining  the  sine-wave  parameters  was 
proposed.  When  sine-wave  frequencies  of  the  underlying  waveforms  were  closely  spaced,  a  multi¬ 
frame  interpolation  scheme  was  used  to  recover  missing  spectral  regions  which  could  not  be 
resolved  by  the  LSE  method. 

Although  success  of  the  LSE  relies  on  accurate  frequency  estimates,  results  of  this  report 
indicate  that  pitch  estimates  can  be  used  in  place  of  frequency  estimates.  Pitch  extraction  is  also 
necessary  as  a  means  of  partitioning  the  frequency  set  into  a  subset  that  is  attributed  to  speaker 
A  and  a  subset  that  is  attributed  to  speaker  B.  Thus,  the  further  development  of  a  pitch  estima¬ 
tion  algorithm,  capable  of  handling  summed  waveforms  of  vastly  different  intensity  levels,  is  criti¬ 
cal  to  sinusoidal  speech  separation  and  enhancement. 

The  results  of  this  report  lead  to  a  number  of  other  important  areas  of  continued  research. 
The  experiments  at  the  onset  of  this  report  using  the  peak-picking  and  frequency-sampling 
methods  raise  some  fundamental  questions  about  the  methods  themselves,  about  the  limitations 
of  sine-wave  based  analysis-synthesis,  and  about  how  more  accurate  parameter  estimation  may  be 
achieved  in  this  context.  These  methods,  as  well  as  the  LSE  approach,  relied  on  local  characteris¬ 
tics  of  the  STFT,  i.e.,  separation  was  attempted  without  recourse  to  correlation  in  the  STFT 
across  frequency.  A  drawback  to  this  approach  was  manifested  in  solving  for  the  sine-wave 
parameters  by  the  LSE  method  in  the  presence  of  closely  spaced  frequencies.  One  form  of 
accounting  for  spectral  correlation  might  take  the  form  of  estimating  amplitude  and  phase  enve¬ 
lopes  of  the  STFT  of  each  speech  waveform,  along  with  sine-wave  frequencies  or  pitch.  Such 
estimation  might  use  envelope  models  or  template-based  envelope  matching.  It  is  also  of  interest 
to  develop  techniques  for  joint  estimation  of  sinusoidal  amplitudes,  frequencies,  and  phases  which 
include  multi-frame  continuity  constraints  and  interpolation  strategies. 

This  report  was  limited  to  utterances  for  which  a  vocalic  excitation  was  present.  There  are  a 
host  of  utterances  for  which  such  an  excitation  is  not  present.  Examples  include  fricatives,  stop 
consonants,  and  whispers.  Separation  of  such  combined  utterances  will  require  relaxing  the  har¬ 
monic  model,  using  more  complex  modeling  of  the  speech  waveform,  and  jointly  estimating  sinu¬ 
soidal  model  parameters  and  voicing  states  of  target  and  interfering  signals. 


APPENDIX  A 


In  this  Appendix,  we  demonstrate  the  relation  between  the  various  sine-wave  based  represen¬ 
tations  of  x(n)  used  in  Sections  2  and  3.  From  a  standard  trigonometric  identity,  we  can  write 
(2.2) 

M 

x(n)  =  21 

k=l 

M 

=  21  COS  (0ij)  cos  (toj^n) 
k=l 

M 

=  21  ^k  (^k)  si"  (<Uk") 
k=l 

M  M 

=  21  “k  COS  (oii^n)  +  sinfcui^n)  (Al.a) 

k=0  k=0 

where 


Ok  =  a|(  cos  {<f>y) 

-  -a^  sin  (<^) 

Alternatively,  the  sequence  x(n)  can  be  written 


M  a. 

x(n)  =  2  —  e’'^'^  e'"''"  +  21 

k=i  ^  k=l  ^ 


(Al.b) 


(A2). 


Assuming  negligible  contribution  from  the  negative  frequency  terms  in  (A2),  then  the  STFT 
x(n),  S(<u),  for  oj  =  cu|(  >  0  can  be  written  as 

21  WK-w2)  =  S(a>k) 

2=1 


of 

(A3.a) 


or 

M  . 

21  Y  ~  [Sfwk)] 

2=1  ^ 

^  ao 

2)  Y  [S(wk)] 

2=1  ^ 


(A3.b) 
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and  therefore. 


M 

2]  «2  W(a>ij  -  <uq)  =  2Re  [S((U)^)] 

2=1 

M 

2  Pi  W(a*k  -  «2)  =  -2Rc  [S(aik)3  (A3.c) 

2=1 

where, 

aj  -  aj  cos  (02) 

/3l  =  -a£  sin  (02) 
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APPENDIX  B 

THE  LEAST  SQUARED  ERROR  SOLUTION 


The  steady  state  model  for  a  frame  of  summed  vocalic  speech  is  expressed  as  a  sum  of 
sinusoids. 


M 

x(n)  ~  2!  ‘^k) 

k=l 


Alternately,  (B.l)  may  be  written  in  terms  of  quadrature  components. 


M  M 

x(n)  =  2)  ajc  cos  (aj^n)  +  2)  ^k  W")  (®-2) 

k=0  k=0 

=  +  =  tan*‘  (-/3/a) 

In  this  form,  it  is  easy  to  see  that  x(n)  is  a  linear  function  of  the  unknown  coefficients  a)^  and 
/3|(.  Let  N  be  the  length  of  the  frame.  Assume  that  N  is  odd.  This  appendix  will  determine  the 
coefficients  a|^  and  /3)^  that  yield  the  best  fit  of  the  model,  (B.2),  to  a  single  frame  of  additive 
vocalic  speech,  over  the  region  -  (N  -  l)/2<  n^  (N  -  l)/2,  when  the  frequencies  cui^  are  given. 


Before  proceeding,  we  introduce  some  vector  notation  that  will  allow  concise  statement  of 
the  linear  least  squares  problem.  Let  the  vectors  £  and  x  be  the  sampled  waveform  and  the 
sampled  model.  The  set  of  sinusoidal  frequencies  will  be  denoted  by  the  vector  w. 


s[-(N-l)/2] 

x[-(N-l)/2] 

to, 

s[l-(N-l)/2] 

x[l-(N-l)/2] 

C02 

s[2-(N-l)/2] 

x[2-(N-l)/2] 

t03 

s[n-(N-l)/2] 

X  = 

x[n-(N-l)/2] 

Oj  = 

"n 

s[(N-l)/2] 

x((N-l)/2] 

"M 

(B.3) 
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Let  the  vectors  r(cu|^)  and  d((u^)  denote  the  respective  time  sequences  of  cosine  and  sine  functions. 


The  vectors  a  and  §  will  be  the  coefficient  vectors  for  the  cosine  and  sine  terms  in  the  model 
(B.2).  The  matrices  Rfoi)  and  D(^  will  have  rfcu]^)  and  d(a>^)  for  their  respective  kth  columns. 


R(«i)  =  [[r(w|)  j  1(0,2)  i  I(“3)  j  (B.5) 

,B,6) 

D(»)  =  (d(«l)  I  d(<U2)  I  4("3)  }  (B-7) 

[I5(2)3n.k  =  sin  [«k  (n  “  (B-8) 


The  notation  R(^  suggests  the  dependency  of  the  matrix  on  a,.  This  parenthetical  notation  will 
be  dropped  for  convenience,  since  the  vector  of  frequencies  is  considered  to  be  a  fixed  parameter 
throughout. 

The  model  (B.2)  can  now  be  concisely  expressed  in  terms  of  the  adopted  vector  notation. 

2t=Ra  +  D^  (B.9) 

The  general  least  squares  problem  can  be  stated  as  follows: 

V  =  min[x(;^  -^T  W(x(^  -  s]  (B.  10) 

where  x(v)  is  the  model  parameterized  by  the  vector  v  consisting  of  the  vectors  a  and  §.  The 
matrix  W  is  a  positive  definite  diagonal  weighting  matrix  with  diagonal  terms  equal  to  the 
analysis  window. 
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(N-l)/2]  0  0 

0  w(l-(N-l)/2]  0 

0  0  w(2-(N-l)/2] 

0  0  0 


(B.ll) 


L  w[(N-l)/2]J 

Since  the  model  is  a  linear  function  of  the  parameter  vector,  the  model  can  be  written  as 


x  =  Fy  (B.12) 

where  F  is  given  as  a  partitioned  matrix. 

F  =  [R1D]  (B.13) 


The  solution  of  the  linear  least  squares  problem  (B.  10)  is  well  known,  and  given  by  the  solution 
to  the  following  matrix  equation. 

FTWFy=FTWs  (B.14) 

so  that  y  can  be  written  in  terms  of  the  inverse  of  F^WF.  This  inverse  will  exist  if  the  columns 
of  F  are  linearly  independent.'^ 


y=[FTWF]-'  F^Ws 


(B.15) 


Because  of  the  orthogonality  between  the  sampled  cosine  and  sine  terms,  and  because  the 
weighting  is  symmetric  about  the  origin,  the  partitioned  matrix  of  inner  products  will  have  no 
cross  terms. 


FTwFs 


R^WR 

0 


0 

DTwD 


(B.16) 


Thus,  the  LSE  solution  can  be  written  in  terms  of  two  independent  expressions.  The  first  equa¬ 
tion  will  3deld  the  coefficient  vector  a.  The  second  equation  can  be  solved  for  the  coefficient  vec¬ 
tor  of  the  quadrature  terms  fi. 

a  =  [RTwR]-'  RTWs  (B.17) 

£  =  [DTwD]-'  DTWs  (B.18) 

In  order  to  interpret  this  result  in  the  context  of  Section  3,  it  will  be  helpful  to  rewrite  the 
short  time  Fourier  transform  (STFT)  in  terms  of  its  real  and  imaginary  parts. 

(N-l)/2 

S„(a>)  =  2)  s(”)  Mn)  cos  (<wn)  (B.  19) 

n=-{N-l)/2 
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(B.20) 


(N-l)/2 

n=-<N-l)/2 

This  can  be  expressed  in  terms  of  the  adopted  vector  notation. 
S„(e,)  =  r!(e*)TWs 
Si„,(«)r-d(a.)TW8 


(B.2I) 

(B.22) 


The  real  and  imaginary  parts  of  the  STFT  sampled  at  the  frequencies  <U],a>2,a>3, . ,  can 

be  expressed  as  a  vector  function  of  a  vector  variable. 

Sre(a»i) 

Sfie(<U2) 

Sre(*»3) 


Src(si)=  =RTWs 

Sre("k) 


(B.23) 


and  likewise  for  the  imaginary  component.  From  Equation  (B.17)  and  Equation  (B.23),  it  should 
be  clear  that  the  vector  a  is  obtained  by  pre>multiplying  a  vector  which  contains  frequency 
samples  of  the  real  part  of  the  STFT  by  the  matrix  [R^WR]**.  The  parameter  vector  £  is  sim¬ 
ilarly  obtained  by  pre-multiplying  a  vector  which  contains  frequency  samples  of  the  imaginary 
part  of  the  STFT  by  the  matrix  -  [D^WD]"*. 

It  is  left  to  show  that 


rtwr  •  dtwd  -  —  H 
2 


(B.24) 


for  the  window  lengths,  frequency  spacing,  and  window  transforms  considered  in  Section  3.  In 


W(0) 

W(«i|  -  0*2) 

W(o»|  -  W3) 

W(®i  -<um) 

W(o>2  -  a>i) 

W(0) 

W(<W2  -  <03) 

W((W2  -  wm) 

W(o)3  -  (tf|) 

W(«3  -  0*2) 

W(0) 

.  .  .  W(<U3  -  a>M) 

W(ai{^  -  W|) 

W(a)M  -  <"2) 

W(«M  -  "3) 

W(0) 

(B.25) 


Now  since 
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then 


[WR]n,k  =  wCn  -  (N  +  l)/2]  cos  [«k(n  -  )] 


(B.26) 


(N-i)/2 

[RTWR]i  ^  ^  cos  (cujn)  w(n)  cos  (w^n)  (B.27) 

n=-(N-l)/2 

It  is  apparent  that  (B.27)  is  the  cosine  transform  of  the  windowed  cosine  sequence, 
w(n)  cos  ((ui^n).  Therefore,  it  must  equal  the  real  part  of  the  Fourier  transform  of  that  sequence. 
Using  the  modulation  property  of  the  Fourier  transform,  along  with  the  fact  that  the  window 
transform  is  purely  real. 

[RTWR]j,k  =  Y  W(an  -  «ik)  +  Y  W(a,2  +  a>k)  (B.28) 

Equation  (B.28)  gives  the  entries  in  the  matrix  for  the  exact  solution  to  the  LSE  problem.  The 
approximation  used  in  Section  3  neglected  the  term  1/2  W((i>{  *  ot]^);  that  is, 

[RTWRIjj,  -  Y  W(a,j  -  a,k)  (B.29) 

This  approximation  is  valid  when  the  window  length  is  sufficient  (i.e.,  the  bandwidth  of  the 
window  transform  is  sufficiently  narrow).  The  approximation  was  primarily  introduced  for 
reasons  of  style.  It  allows  for  a  less  cluttered,  more  intuitive  argument  in  Section  3.  The  exact 
matrix  entries  (B.28)  were  used  in  all  algorithm  simulations. 


APPENDIX  C 

A  SENSITIVITY  CALCULATION 


In  this  appendix,  it  is  shown  that  ill-conditioning  of  the  least  squares  error  solution  increases 
as  the  spacing  between  the  sine-wave  frequencies  decreases.  Consider  the  solution  to  separation  of 
two  overlapping  lobes  as  described  by  Figure  3-1  with  (U|  =  a>  and  oj2  =  tu  +  l. 


(Cl) 


The  solution  to  this  matrix  equation  is  obtained  by  inverting  the  matrix  that  appears  on  the  left 
of  (C.l).  Let  this  matrix  be  denoted  oy  H(A)  to  make  explicit  the  dependence  of  the  matrix  upon 
the  frequency  difference.  The  solution  to  (C.l)  is  given  by, 


(C.2) 


1 

W(A) 

Sa(<u) 

■  S(a,)  ■ 

_W(A) 

1 

Sb(<o  +  A) 

S((i>  +  A) 

‘  Sa(a>)  ' 

S(io)  ■ 

=  H->(A) 

Sbfoi  +  A) 

S(ft<  +  A) 

The  partial  derivative  of  the  right  side  of  this  matrix  equation,  with  respect  to  A,  can  be 
expressed  as  follows; 


Sa(a*)  ■ 

■  - 

a 

S(co) 

0  ■ 

_Sb(a>  +  A) 

=  -H-«(A) 

H-'(A) 

S(a)  +  A) 

+  H(A)-> 

(C.3) 


which  can  be  expanded  as. 


-1 


|detH(A)|' 


1  -W(A) 

-W(A)  1 


a 

—  W(A) 

aA 


-W(A)  1 


n 

S(a>) 

S(a>  +  A) 

(C.4) 


1 


where. 


|detH(A)| 


detH(A)  =  1  -  w2(A). 


-W(A) 

1 


When  A— 0,  detH(A)— 0.  Therefore,  as  the  frequency  spacing  tends  to  zero,  the  data 
independent  factor  in  the  second  term  in  Equation  (C.4)  tends  to  infinity.  For  small  A,  the 
matrix  equation  is  poorly  conditioned;  the  solution  is  sensitive  to  small  perturbations  in  A. 
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